%\input{common/conf_top.tex}
\input{common/conf_top_print.tex}  %settings for printed booklets - comment out by default, uncomment for print and comment out line above. don't save this change! "conf_top" should be default

\input{common/conf_titles.tex}
\input{generated_buildinfo.tex}

\lstdefinestyle{pythoncode}{
  language=python,
  frame=single,
  breaklines,
  basicstyle=\ttfamily\scriptsize,
  commentstyle=\textsl,% comment style
  keywordstyle=\ttfamily\scriptsize,
  numbers=left,% display line numbers on the left side
  numberstyle=\scriptsize,% use small line numbers
  numbersep=10pt,% space between line numbers and code
  backgroundcolor=\color{white},
  showstringspaces=false %don't show spaces as weird char.
}

\begin{document}

\input{common/conf_listings.tex} %see note for `conf_top_print.tex` above
%\input{common/conf_listings_colorized.tex}  % Work in progress....  Use this for online/pdf version


\thispagestyle{empty} %removes page number
\newgeometry{bmargin=0cm, hmargin=0cm}


\begin{center}
\textsc{\Large\bf{Machine Learning with Python and H2O}}

\bigskip
\line(1,0){250}  %inserts  horizontal line 
\\
\bigskip
\small
\textsc{Spencer Aiello \hspace{10pt} Cliff Click  \hspace{10pt} }

\textsc{Hank Roark  \hspace{10pt} Ludi Rehak}

\textsc{Edited by: Jessica Lanford}

\normalsize

\line(1,0){250}  %inserts  horizontal line

{\url{http://h2o.ai/resources/}}

\bigskip

\monthname \hspace{1pt}  \the\year: Second Edition 

\bigskip
\end{center}

% commenting out lines image due to print issues, but leaving in for later
%\null\vfill
%\begin{figure}[!b]
%\noindent\makebox[\textwidth]{%
%\centerline{\includegraphics[width=\paperwidth]{waves.png}}}
%\end{figure}

\newpage
\restoregeometry

\null\vfill %move next text block to lower left of new page

\thispagestyle{empty}%remove pg#

{\raggedright 

Machine Learning with Python and H2O\\
  by Spencer Aiello, Cliff Click, \\ Hank Roark \&\  Ludi Rehak\\
Edited by: Jessica Lanford
\bigskip

Published by H2O.ai, Inc. \\
2307 Leghorn St. \\
Mountain View, CA 94043\\
\bigskip
\textcopyright \the\year \hspace{1pt} H2O.ai, Inc. All Rights Reserved. 
\bigskip

\monthname \hspace{1pt}  \the\year: First Edition 
\bigskip

Photos by \textcopyright H2O.ai, Inc.
\bigskip

All copyrights belong to their respective owners.\\
While every precaution has been taken in the\\
preparation of this book, the publisher and\\
authors assume no responsibility for errors or\\
omissions, or for damages resulting from the\\
use of the information contained herein.\\
\bigskip
Printed in the United States of America. 
}


\newpage
\thispagestyle{empty}%remove pg#

\tableofcontents
\thispagestyle{empty}%remove pg#

\newpage

\section{Introduction}

This documentation describes how to use H2O from Python. More information on H2O's system and algorithms
(as well as complete Python user documentation) is available at the H2O website at {\url{http://docs.h2o.ai}}.

H2O Python uses a REST API to connect to H2O. To use H2O in Python or launch H2O from Python, specify the IP address and port number of the H2O instance in the Python environment. Datasets are not directly transmitted
through the REST API. Instead, commands (for example, importing a dataset at specified HDFS location) are sent either through the browser or the REST API to perform the specified task.

The dataset is then assigned an identifier that is used as a reference in  commands to the web server.
After one prepares the dataset for modeling by defining significant data and removing insignificant data,
H2O is used to create a model representing the results of the data analysis.
These models are assigned IDs that are used as references in commands.

Depending on the size of your data,  H2O can run on your desktop or scale using multiple nodes with Hadoop,
an EC2 cluster, or Spark.  Hadoop is a scalable open-source file
system that uses clusters for distributed storage and dataset processing. H2O nodes run as JVM invocations on Hadoop
nodes. For performance reasons, we recommend that you do not run an H2O node on the same hardware as the Hadoop
NameNode.

H2O helps Python users make the leap from single machine based processing to large-scale distributed environments.
Hadoop lets H2O users scale their data processing capabilities based on their current needs.
Using H2O, Python, and Hadoop, you can create a complete end-to-end data analysis solution.

This document describes the four steps of data analysis with H2O:
\begin{enumerate}

\item installing H2O
\item preparing your data for modeling
\item creating a model using simple but powerful machine learning algorithms
\item scoring your models

\end{enumerate}

\newpage
\input{common/what_is_h2o.tex}
%Need to put into something about where H2O fits into the PyData / SciKit ecosystem.

\subsection{Example Code}

Python code for the examples in this document is located here:

\url{https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/booklets/v2_2015/source/python}

\subsection{Citation}

To cite this booklet, use the following: 

Aiello, S., Cliff, C., Roark, H., Rehak, L., and Lanford, J. (Nov. 2015) {\textit{Machine Learning with Python and H2O}. {\url{http://h2o.ai/resources/}}. 


\section{Installation} 
\Urlmuskip=0mu plus 1mu\relax %needed to make long URLs break nicely

H2O requires Java; if you do not already have Java installed, install it from {\url{https://java.com/en/download/}} before installing H2O. 

The easiest way to directly install H2O is  via a Python package.

({\bf{Note}}: The examples in this document were created with H2O version \waterVersion.)

\subsection{Installation in Python}

To load a recent H2O package from PyPI, run:

\begin{lstlisting}[style=python]
pip install h2o
\end{lstlisting}

To download the
latest stable H2O-3 build from the H2O download page:

\begin{enumerate}
\item Go to {\url{http://h2o.ai/download}}.
\item Choose the latest stable H2O-3 build.
\item Click the ``Install in Python'' tab.
\item Copy and paste the commands into your Python session.
\end{enumerate}

After H2O is installed, verify the installation:

\begin{lstlisting}[style=python]
import h2o

# Start H2O on your local machine
h2o.init()

# Get help
help(h2o.estimators.glm.H2OGeneralizedLinearEstimator)
help(h2o.estimators.gbm.H2OGradientBoostingEstimator)

# Show a demo
h2o.demo("glm")
h2o.demo("gbm")
\end{lstlisting}


\section{Data Preparation}
The next sections of the booklet demonstrate the Python interface using examples, which include  short snippets of code and the
resulting output.  

In H2O, these operations all occur distributed and in
parallel and can be used on very large datasets.  More information about the
Python interface to H2O can be found at {\texttt{docs.h2o.ai}}.

Typically, we import and start H2O on the same machine as the running Python process:
\lstinputlisting[style=pythoncode, firstline=1, lastline=31]{python/ipython_dataprep_output_fixed.txt}

To connect to an established H2O cluster (in a multi-node Hadoop environment, for example):
\lstinputlisting[style=pythoncode, firstline=587, lastline=588]{python/ipython_dataprep_output_fixed.txt}

To create an H2OFrame object from a Python tuple:
\lstinputlisting[style=pythoncode, firstline=33, lastline=46]{python/ipython_dataprep_output_fixed.txt}

To create an H2OFrame object from a Python list:
\lstinputlisting[style=pythoncode, firstline=48, lastline=61]{python/ipython_dataprep_output_fixed.txt}

\newpage

To create an H2OFrame object from  \texttt{collections.OrderedDict} or a Python dict:
\lstinputlisting[style=pythoncode, firstline=63, lastline=76]{python/ipython_dataprep_output_fixed.txt}

To create an H2OFrame object from a Python dict and specify the column types:
\lstinputlisting[style=pythoncode, firstline=78, lastline=93]{python/ipython_dataprep_output_fixed.txt}

To display the column types:
\lstinputlisting[style=pythoncode, firstline=95, lastline=96]{python/ipython_dataprep_output_fixed.txt}

\newpage

\subsection{Viewing Data}
To display the top and bottom of an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=98, lastline=128]{python/ipython_dataprep_output_fixed.txt}


To display the column names:
\lstinputlisting[style=pythoncode, firstline=130, lastline=131]{python/ipython_dataprep_output_fixed.txt}

To display compression information, distribution (in multi-machine clusters), and summary statistics of your data:


\lstinputlisting[style=pythoncode, firstline=133, lastline=161]{python/ipython_dataprep_output_fixed.txt}


\subsection{Selection}
To select a single column by name, resulting in an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=162, lastline=174]{python/ipython_dataprep_output_fixed.txt}

To select a single column by index, resulting in an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=175, lastline=187]{python/ipython_dataprep_output_fixed.txt}

\newpage
To select multiple columns by name, resulting in an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=188, lastline=200]{python/ipython_dataprep_output_fixed.txt}

To select multiple columns by index, resulting in an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=201, lastline=213]{python/ipython_dataprep_output_fixed.txt}

To select multiple rows by slicing, resulting in an H2OFrame: 

\textbf{Note} By default, H2OFrame selection is for columns, so to slice by rows
and get all columns, be explicit about selecting all columns:
\lstinputlisting[style=pythoncode, firstline=215, lastline=222]{python/ipython_dataprep_output_fixed.txt}

To select rows based on specific criteria, use Boolean masking:
\lstinputlisting[style=pythoncode, firstline=224, lastline=228]{python/ipython_dataprep_output_fixed.txt}


\subsection{Missing Data}
The H2O parser can handle many different representations of missing data types, including `' (blank),
`NA',  and None (Python).  They are all displayed as \texttt{NaN} in Python.

To create an H2OFrame from Python with missing elements:
\lstinputlisting[style=pythoncode, firstline=230, lastline=246]{python/ipython_dataprep_output_fixed.txt}

To determine which rows are missing data for a given column (`1' indicates missing):
\lstinputlisting[style=pythoncode, firstline=248, lastline=255]{python/ipython_dataprep_output_fixed.txt}

To change all missing values in a column to a different value:
\lstinputlisting[style=pythoncode, firstline=259, lastline=266]{python/ipython_dataprep_output_fixed.txt}

\newpage
To determine the locations of all missing data in an H2OFrame:
\lstinputlisting[style=pythoncode, firstline=268, lastline=275]{python/ipython_dataprep_output_fixed.txt}

\subsection{Operations}
When performing a descriptive statistic on an entire H2OFrame, missing data is generally excluded
and the operation is only performed on the columns of the appropriate data type:
\lstinputlisting[style=pythoncode, firstline=277, lastline=287]{python/ipython_dataprep_output_fixed.txt}

When performing a descriptive statistic on a single column of an H2OFrame, missing data is generally 
\textit{not} excluded:
\lstinputlisting[style=pythoncode, firstline=289, lastline=293]{python/ipython_dataprep_output_fixed.txt}

In both examples,  a native Python object is returned (list and float respectively
in these examples).

When applying functions to each column of the data, an H2OFrame containing the means of each column is returned:
\lstinputlisting[style=pythoncode, firstline=295, lastline=303]{python/ipython_dataprep_output_fixed.txt}

When applying functions to each row of the data, an H2OFrame containing the sum of all columns is returned:
\lstinputlisting[style=pythoncode, firstline=305, lastline=317]{python/ipython_dataprep_output_fixed.txt}

H2O provides many methods for histogramming and discretizing data.
Here is an example using the {\texttt{hist}} method on a single data frame:
\lstinputlisting[style=pythoncode, firstline=319, lastline=337]{python/ipython_dataprep_output_fixed.txt}

H2O includes a set of string processing methods in the H2OFrame class
that make it easy to operate on each element in an H2OFrame.  

To determine the number of times a string is contained in each element:
\lstinputlisting[style=pythoncode, firstline=340, lastline=361]{python/ipython_dataprep_output_fixed.txt}

To replace the first occurrence of `l' (lower case letter) with `x' and return a new H2OFrame:
\lstinputlisting[style=pythoncode, firstline=364, lastline=372]{python/ipython_dataprep_output_fixed.txt}

For global substitution, use {\texttt{gsub}}.  Both {\texttt{sub}} and {\texttt{gsub}} support regular expressions. To split strings based on a regular expression:
\lstinputlisting[style=pythoncode, firstline=374, lastline=382]{python/ipython_dataprep_output_fixed.txt}


\subsection{Merging}
To combine two H2OFrames together by appending one as rows and return a new H2OFrame:
\lstinputlisting[style=pythoncode, firstline=384, lastline=406]{python/ipython_dataprep_output_fixed.txt}

For successful row binding, the column names and column types between the two H2OFrames must match.

H2O also supports merging two frames together by matching column names:
\lstinputlisting[style=pythoncode, firstline=425, lastline=451]{python/ipython_dataprep_output_fixed.txt}

\newpage

\subsection{Grouping}

"Grouping" refers to the following process:

\begin{itemize}
\item splitting the data into groups based on some criteria 
\item applying a function to each group independently
\item combining the results into an H2OFrame
\end{itemize}

To group and then apply a function to the results:
\lstinputlisting[style=pythoncode, firstline=453, lastline=480]{python/ipython_dataprep_output_fixed.txt}

To group by multiple columns and then apply a function:
\lstinputlisting[style=pythoncode, firstline=487, lastline=497]{python/ipython_dataprep_output_fixed.txt}

\newpage

To join the results into the original H2OFrame:
\lstinputlisting[style=pythoncode, firstline=499, lastline=509]{python/ipython_dataprep_output_fixed.txt}

\subsection{Using Date and Time Data} 
H2O has powerful features for ingesting and feature engineering using time data.  Internally, H2O
stores time information as an integer of the number of milliseconds since the epoch.

To ingest time data natively, use one of the supported time input formats:
\lstinputlisting[style=pythoncode, firstline=511, lastline=518]{python/ipython_dataprep_output_fixed.txt}

To display the day of the month:
\lstinputlisting[style=pythoncode, firstline=520, lastline=525]{python/ipython_dataprep_output_fixed.txt}

To display the day of the week:
\lstinputlisting[style=pythoncode, firstline=527, lastline=532]{python/ipython_dataprep_output_fixed.txt}

\subsection{Categoricals}
H2O handles categorical (also known as enumerated or factor) values in an H2OFrame.  This is significant because categorical
columns have specific treatments in each of the machine learning algorithms.

Using `df12' from above, H2O imports columns A and B as categorical/enumerated/factor types:
\lstinputlisting[style=pythoncode, firstline=534, lastline=536]{python/ipython_dataprep_output_fixed.txt}

To determine if any column is a categorical/enumerated/factor type:
\lstinputlisting[style=pythoncode, firstline=537, lastline=538]{python/ipython_dataprep_output_fixed.txt}

To view the categorical levels in a single column:
\lstinputlisting[style=pythoncode, firstline=540, lastline=541]{python/ipython_dataprep_output_fixed.txt}

To create categorical interaction features:
\lstinputlisting[style=pythoncode, firstline=543, lastline=555]{python/ipython_dataprep_output_fixed.txt}

\newpage
To retain the most common categories and set the remaining categories to a common `Other' category
 and create an interaction of a categorical column with itself:
\lstinputlisting[style=pythoncode, firstline=557, lastline=571]{python/ipython_dataprep_output_fixed.txt}

These can then be added as a new column on the original dataframe:
\lstinputlisting[style=pythoncode, firstline=573, lastline=585]{python/ipython_dataprep_output_fixed.txt}

\subsection{Loading and Saving Data}
In addition to loading data from Python objects, H2O can load data directly from:
\begin{itemize}
\item disk
\item network file systems (NFS, S3)
\item distributed file systems (HDFS)
\item HTTP addresses
\end{itemize}

\newpage

 H2O currently supports the following file types:

\begin{frame}%no line table for list of 6+ items

\begin{tabular}{p{5.5cm}p{5.5cm}}

\begin{itemize}
\item CSV (delimited) files
\item ORC
\item SVMLite
\end{itemize} &

\begin{itemize}
\item ARFF
\item XLS
\item XLST 
\end{itemize}

\end{tabular}
\end{frame}


To load data from the same machine running H2O:
\lstinputlisting[style=pythoncode, firstline=591, lastline=592]{python/ipython_dataprep_output_fixed.txt}

To load data from the machine running Python to the machine running H2O:
\lstinputlisting[style=pythoncode, firstline=593, lastline=594]{python/ipython_dataprep_output_fixed.txt}

To save an H2OFrame on the machine running H2O:
\lstinputlisting[style=pythoncode, firstline=595, lastline=596]{python/ipython_dataprep_output_fixed.txt}

To save an H2OFrame on the machine running Python:
\lstinputlisting[style=pythoncode, firstline=597, lastline=598]{python/ipython_dataprep_output_fixed.txt}

\section{Machine Learning}

The following sections describe some common model types and features. 


\subsection{Modeling}
The following section describes the features and functions of some common models available in H2O.  For more information about running these models in Python using H2O, refer to the documentation on
the H2O.ai website or to the booklets on specific models.


H2O supports the following models:  


\begin{frame}%no line table for list of 6+ items

\begin{tabular}{p{5.5cm}p{5.5cm}}

\begin{itemize}
  \item Deep Learning
  \item Na\"{i}ve Bayes
  \item Principal Components Analysis (PCA)
  \item K-means
\end{itemize} &

\begin{itemize}
  \item Generalized Linear Models (GLM) 
  \item Gradient Boosted Regression (GBM)
% \item Generalized Low Rank Model (GLRM) Supported?
  \item Distributed Random Forest (DRF)
\end{itemize}

\end{tabular}

\end{frame}


The list is growing quickly, so check \url{www.h2o.ai} to see the latest additions. 

\subsubsection{Supervised Learning}


{\textbf{Generalized Linear Models (GLM)}}: Provides flexible generalization of ordinary linear regression for response variables with error distribution models other than a Gaussian (normal) distribution. GLM unifies various other statistical models, including Poisson, linear, logistic, and others when using $\ell_1$ and $\ell_2$ regularization.

{\textbf{Distributed Random Forest}}: Averages multiple decision trees, each created on different random samples of rows and columns. It is easy to use, non-linear, and provides feedback on the importance of each predictor in the model, making it one of the most robust algorithms for noisy data.

{\textbf{Gradient Boosting (GBM)}}: Produces a prediction model in the form of an ensemble of weak prediction models. It builds the model in a stage-wise fashion and is generalized by allowing an arbitrary differentiable loss function. It is one of the most powerful methods available today.

{\textbf{Deep Learning}}: Models high-level abstractions in data by using non-linear transformations in a layer-by-layer method. Deep learning is an example of supervised learning, which can use unlabeled data that other algorithms cannot.

{\textbf{Na\"{i}ve Bayes}}: Generates a probabilistic classifier that assumes the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. It is often used in text categorization.

\subsubsection{Unsupervised Learning}
{\textbf{K-Means}}: Reveals groups or clusters of data points for segmentation. It clusters observations into $k$-number of points with the nearest mean.

{\textbf{Principal Component Analytis (PCA)}}: The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features.

{\textbf{Anomaly Detection}}: Identifies the outliers in your data by invoking the deep learning autoencoder, a powerful pattern recognition model.

\newpage

\subsection{Running Models}
This section describes how to run the following model types:

\begin{itemize}
\item Gradient Boosted Models (GBM)
\item Generalized Linear Models (GLM)
\item K-means
\item Principal Components Analysis (PCA)

\end{itemize}
as well as how to generate predictions.

\subsubsection{Gradient Boosting Models (GBM)}
To generate gradient boosting models for creating forward-learning ensembles,
use {\texttt{H2OGradientBoostingEstimator}}.  

The construction of the estimator
defines the parameters of the estimator and the call to
{\texttt{H2OGradientBoostingEstimator.train}}  trains the estimator on
the specified data.  This pattern is common for each of the H2O algorithms.

\lstinputlisting[style=pythoncode, firstline=3, lastline=103]{python/ipython_machine_learning_output_fixed.txt}

To generate a classification model that uses labels, \\ use 
{\texttt{distribution="multinomial"}}:

\lstinputlisting[style=pythoncode, firstline=105, lastline=166]{python/ipython_machine_learning_output_fixed.txt}

\subsubsection{Generalized Linear Models (GLM)}
Generalized linear models (GLM) are some of the most commonly-used
models for many types of data analysis use cases. While some data
can be analyzed using linear models, linear models
may not be as accurate if the variables are more complex.
For example, if the dependent variable has a non-continuous
distribution or if the effect of the predictors is not linear,
generalized linear models will produce more accurate results than linear models.

Generalized Linear Models (GLM) estimate regression models for
outcomes following exponential distributions in general. In
addition to the Gaussian (i.e. normal) distribution, these include
Poisson, binomial, gamma and Tweedie distributions. Each serves
a different purpose and, depending on distribution and link
function choice, it can be used either for prediction or
classification.

H2O's GLM algorithm fits the generalized linear model with
elastic net penalties. The model fitting computation is distributed,
extremely fast,and scales extremely well for models with a limited
number ($\sim$ low thousands) of predictors with non-zero
coefficients. 

The algorithm can compute models for a single value of a penalty argument or the full regularization path, similar to glmnet. It can compute Gaussian (linear), logistic, Poisson, and gamma regression models.
To generate a generalized linear model for developing
linear models for exponential distributions, use
{\texttt{H2OGeneralizedLinearEstimator}}. You can apply
regularization to the model by adjusting the lambda and alpha
parameters.

\lstinputlisting[style=pythoncode, firstline=168, lastline=296]{python/ipython_machine_learning_output_fixed.txt}

\newpage
\subsubsection{K-means}
To generate a K-means model for data characterization, use
{\texttt{h2o.kmeans()}}. This algorithm does not require a
dependent variable.

\lstinputlisting[style=pythoncode, firstline=298, lastline=339]{python/ipython_machine_learning_output_fixed.txt}

\subsubsection{Principal Components Analysis (PCA)}
To map a set of variables onto a subspace using linear
transformations, use {\texttt{h2o.transforms.decomposition.H2OPCA}}.
This is the first step in Principal Components Regression.
\lstinputlisting[style=pythoncode, firstline=341, lastline=383]{python/ipython_machine_learning_output_fixed.txt}

\newpage

\subsection{Grid Search}
H2O supports grid search across hyperparameters:


\lstinputlisting[style=pythoncode, firstline=385, lastline=423]{python/ipython_machine_learning_output_fixed.txt}

\subsection{Integration with scikit-learn}
The H2O Python client can be used within scikit-learn pipelines and cross-validation searches.  This extends the capabilities of both H2O and scikit-learn.
\subsubsection{Pipelines}

To create a scikit-learn style pipeline using H2O transformers and estimators:

\lstinputlisting[style=pythoncode, firstline=425, lastline=513]{python/ipython_machine_learning_output_fixed.txt}

\subsubsection{Randomized Grid Search}
To create a scikit-learn style hyperparameter grid search using k-fold cross validation:
\lstinputlisting[style=pythoncode, firstline=515, lastline=620]{python/ipython_machine_learning_output_fixed.txt}



\newpage
\section{References}
\bibliographystyle{plainnat}  %alphadin}
\nobibliography{bibliography.bib} %hides entire bibliography so just \bibentry items are included
%use \bibentry{bibname} (where bibname is the entry name in the bibliography) to include entries from bibliography.bib; double brackets {{ are required in .bib file to preserve capitalization

\bibentry{h2osite}

\bibentry{h2odocs}

\bibentry{Pydocs}

\bibentry{h2ogithubrepo}

\bibentry{h2odatasets}

\bibentry{h2ojira}

\bibentry{stream}


\enddocument
